Introduction

Introduction

In this dashboard, we delve into the intricacies of customer loyalty programs within the airline industry. Our dataset, sourced from Northern Lights Air (NLA), a fictional Canadian airline, encapsulates the essence of customer engagement, flight activity, and demographic insights.

Link: https://mavenanalytics.io/data-playground?search=airline+loyalty+program Note: Please scroll to the bottom of the page to find the airline loyalty program dataset.

Abstract

This dashboard aims to explore relationships between predictor variables and the total number of flights in airline loyalty programs. We will analyze the data using multiple regression, ridge regression, non-linear regression, and classification techniques.

Here is a brief overview of the variables in the dataset:
Data Dictionary
Variable Description
Loyalty.Number Unique identifier for loyalty program members
Year Year of flight activity
Month Month of flight activity
Total.Flights Total number of flights
Distance Distance traveled in flights (in miles)
Points.Accumulated Loyalty points accumulated
Points.Redeemed Loyalty points redeemed
Dollar.Cost.Points.Redeemed Cost of loyalty points redeemed (in dollars)
Country Country of residence
Province Province or state of residence
City City of residence
Postal.Code Postal code of residence
Gender Gender of member
Education Level of education
Salary Annual salary of member
Marital.Status Marital status of member
Loyalty.Card Type of loyalty card
CLV Customer lifetime value
Enrollment.Type Type of enrollment (e.g., standard, premium)
Enrollment.Year Year of enrollment
Enrollment.Month Month of enrollment
Cancellation.Year Year of cancellation (if applicable)
Cancellation.Month Month of cancellation (if applicable)

Naive Bayes Classification

Model Explanation

The section runs a Naive Bayes model that looks at various details about loyalty program members to guess their education level. It uses information like where they live, their gender, and how they interact with the loyalty program to make its predictions. This helps understand if certain traits are common among members with different education backgrounds. The idea is to use this understanding to better tailor the loyalty program to fit different members’ needs.

Research Question

How well can demographic factors, geographic factors, and loyalty program status predict the education level of participants in a airline loyalty rewards program?

Data Preparation

The dataset used for the naive bayes classification is a combination of the Customer_Flight_Activity.csv and the Customer_Loyalty_History.csv. The dataset for the Naive Bayes model is prepared by first removing incomplete or irrelevant columns such as Cancellation.Month and Cancellation.Year. Then, categorical variables such as Province and City are changed into factors in order for the model to be able to use them.This preparation makes sure the model has good data to learn from and can be tested properly. Data preparation steps
library(e1071)  
library(dplyr)
library(caret)  

data <- read.csv("/Users/jitpapneja/Google Drive/TCCC Old/Personal/merged_data_EP.csv")

data$Cancellation.Year <- NULL  
data$Cancellation.Month <- NULL
data <- na.omit(data)  

categorical_vars <- c("Country", "Province", "City", "Postal.Code", "Gender", 
                      "Education", "Marital.Status", "Loyalty.Card", "Enrollment.Type")
data[categorical_vars] <- lapply(data[categorical_vars], factor)

Model Training

Model training steps
set.seed(712)  
index <- createDataPartition(data$Education, p=0.8, list=FALSE)
trainData <- data[index, ]
testData <- data[-index, ]

# Applying Naive Bayes
model <- naiveBayes(Education ~ ., data = trainData)

# Predictions
predictions <- predict(model, testData)

Model Performance

Model Performance
Confusion Matrix and Statistics

                      Reference
Prediction             Bachelor Doctor High School or Below Master
  Bachelor                48766    807                 3643   1428
  Doctor                     40   2510                    6    175
  High School or Below      183      0                   22      1
  Master                    107    171                    0    813

Overall Statistics
                                          
               Accuracy : 0.8882          
                 95% CI : (0.8856, 0.8907)
    No Information Rate : 0.8368          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.4845          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: Bachelor Class: Doctor Class: High School or Below
Sensitivity                   0.9933       0.71961                    0.005993
Specificity                   0.3862       0.99600                    0.996655
Pos Pred Value                0.8924       0.91908                    0.106796
Neg Pred Value                0.9181       0.98252                    0.937588
Prevalence                    0.8368       0.05945                    0.062568
Detection Rate                0.8312       0.04278                    0.000375
Detection Prevalence          0.9313       0.04655                    0.003511
Balanced Accuracy             0.6897       0.85780                    0.501324
                     Class: Master
Sensitivity                0.33637
Specificity                0.99506
Pos Pred Value             0.74519
Neg Pred Value             0.97214
Prevalence                 0.04120
Detection Rate             0.01386
Detection Prevalence       0.01859
Balanced Accuracy          0.66571

Model Accuracy

The accuracy of the Naive Bayes classifier is 0.8882, indicating that approximately 88.82% of the predictions made by the model are correct.

Confusion Matrix and Performance Metrics

The confusion matrix and its accompanying metrics shed light on the nuanced performance of the Naive Bayes model, delineating its proficiency in classifying true positive and negative instances as well as the equilibrium it strikes between precision and recall. These statistical insights are imperative for the iterative enhancement of the model’s predictive accuracy. The model’s ability to effectively differentiate between various educational levels, as indicated by the metrics, can provide a foundation for tailored communication strategies and targeted marketing initiatives.

Findings and Discusssion

In the statistics part outputted by the confusion matrix there are many important statistics to look at: Sensitivity (True Positive): Measures the proportion of actual positives which are correctly identified. Specificity (True Negative): Measures the proportion of actual negatives which are correctly identified. Positive Pred Value: Reflects the proportion of positive identifications that were actually correct. Negative Pred Value: Reflects the proportion of negative identifications that were actually correct. Prevalence: Indicates how common the positive class is in the dataset. Detection Rate: This is the proportion of the total number of instances that were correctly identified as positive. Detection Prevalence: The proportion of the total number of instances that were predicted as positive. Balanced Accuracy: This metric averages the proportion of true results (both true positives and true negatives) among the total number of cases examined.

Through the examination of the confusion matrix and r-squared value produced by the Naive Bayes model we can see that the model has an 88% chance of predicting the correct level of education a loyalty program member has based on multiple predictors e.g. their salary, which card they have, and when they enrolled. Some takeaways from the statistics portion of the confusion matrix are that the high value of Sensitivity (0.993) for the Bachelor class indicates that the model did very well in predicting the true positives of the Bachelor class. Another thing we can takeaway from the confusion matrix is the model did very well in predicting both the true positives and true negatives of the Doctor class as we can see from the pos pred (0.919) and neg pred values (0.982). Overall the model did a pretty good job at predicting the education level of the loyalty program members

Logistic Regression

Model Explanation

This section utilizes Logistic Regression to explore whether the year of manufacture can predict the type of electric vehicle, categorizing into types such as hybrids, battery electric vehicles (BEVs), and plug-in hybrids (PHEVs). Logistic regression is apt for this task as it models the probability that a given vehicle falls into a specific category based on its year of manufacture. The method involves setting up a logistic curve, which estimates the likelihood of each vehicle type as a function of its production year, outputting a probability score between 0 and 1. This approach not only aids in predicting the type of electric vehicle but also helps in understanding how the characteristics of electric vehicles have evolved over the years, providing valuable insights for manufacturers planning future models and for analysts studying automotive trends.

Research Question

Can we use the year an electric car was made in order to predict what type of battery it has?

Data Preparation

For this method we will be using the electric vehicle data set as we felt that the airline loyalty program data was inadequate for logistic regression. To prepare the data to answer our research question we first look at how many different types of electric vehicles there are. We found that there are 2 types of electric vehicles which are Battery Electric Vehicle (BEV) and Plug-in Hybrid Electric Vehicle (PHEV). Then we changed them to factors in order for our model to handle them. Finally we split the data into a test and training data set to run the model on.
Data preparation steps (click to expand)
# Read the data
electric = read.csv("/Users/jitpapneja/Google Drive/TCCC Old/Personal/Electric_Vehicle_Population_Data (1).csv")

# View the unique values in Electric Vehicle Type to check the encoding
unique(electric$Electric.Vehicle.Type)

# Encoding 'Electric Vehicle Type' as a factor
electric$Electric.Vehicle.Type <- as.factor(electric$Electric.Vehicle.Type)

# Splitting the dataset into training and testing sets
set.seed(42)
training_rows <- createDataPartition(electric$Electric.Vehicle.Type, p = 0.8, list = FALSE)
train_data <- electric[training_rows, ]
test_data <- electric[-training_rows, ]

Model Training

Model Training
# Fit the logistic regression model
model <- glm(Electric.Vehicle.Type ~ Model.Year, data = train_data, family = binomial())

# Predicting on the test set
predictions <- predict(model, test_data, type = "response")
predicted_classes <- ifelse(predictions > 0.5, 1, 0)

# Actual classes
actual_classes <- as.numeric(test_data$Electric.Vehicle.Type) - 1

Model Performance

Model Performance
# Calculate accuracy
accuracy <- mean(predicted_classes == actual_classes)
print(paste("Accuracy on test set:", accuracy))
[1] "Accuracy on test set: 0.782447361763135"

Call:
glm(formula = Electric.Vehicle.Type ~ Model.Year, family = binomial(), 
    data = train_data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) 249.659957   4.082609   61.15   <2e-16 ***
Model.Year   -0.124215   0.002021  -61.46   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 148984  on 142292  degrees of freedom
Residual deviance: 145284  on 142291  degrees of freedom
AIC: 145288

Number of Fisher Scoring iterations: 4

Findings and Discussion

After looking at the model we can see that the p-value for model year is statistically significant which means that it is a viable variable to look at. The number of Fisher Scoring iterations is 4 which tells us that the model took 4 iterations to reach the final model. After calculating the accuracy of the model we see that we get an accuracy of 0.782 which indicates that our model predicted the correct electric vehicle type based on the model year 78% of the time.

Ridge Regression

Model Explanation

This section uses Ridge Regression to analyze factors that affect flight frequency among loyalty program customers, considering variables like income, past flights, and seasonal preferences. By applying a penalty to the regression coefficients, it minimizes errors from multicollinearity and balances the model’s complexity. The optimal penalty parameter, λ (lambda), is determined through cross-validation to ensure accuracy without overfitting. This analysis helps identify the main drivers of customer behavior, enabling the airline to optimize its loyalty program for better engagement and effectiveness.

Research Question

What factors most significantly influence the frequency of flights taken by customers in a loyalty program, and how can these insights be used to optimize the program’s effectiveness and customer engagement?

Data Preparation

In order to prepare our data to run Ridge Regression we first identified our target and predictor variables. Our target variable is Total Flights and our predictor variables are Distance, Points Accumulated, Points Redeemed, and Salary. Then we scaled our predictor variables which normalizes the range of independent variables so that each one contributes equally to the analysis. After that we put our predictor variables into a matrix and target varibale into a vector. Data Preparation Steps
library(glmnet)

data <- read.csv("/Users/jitpapneja/Google Drive/TCCC Old/Personal/merged_data_EP.csv")

# Predictor Variables and Target
predictors <- data[, c("Distance", "Points.Accumulated", "Points.Redeemed")]
target <- data$Total.Flights  # Now predicting Total Flights

# Scaling predictors
predictors_scaled <- scale(predictors)

# Matrix format conversion
predictors_matrix <- as.matrix(predictors_scaled)
target_vector <- as.vector(target)

Model Training

Model Training Steps
# Set alpha for Ridge Regression
alpha <- 0

set.seed(712) 
cv_ridge <- cv.glmnet(predictors_matrix, target_vector, alpha = alpha)
final_model <- glmnet(predictors_matrix, target_vector, alpha = alpha, lambda = cv_ridge$lambda.min)

Model Perfomance

Model Performance
Optimal lambda: 0.178239 
4 x 1 sparse Matrix of class "dgCMatrix"
                           s0
(Intercept)        1.29488772
Distance           0.97198159
Points.Accumulated 0.72855024
Points.Redeemed    0.07320769

Call:  cv.glmnet(x = predictors_matrix, y = target_vector, alpha = alpha) 

Measure: Mean-Squared Error 

    Lambda Index Measure       SE Nonzero
min 0.1782   100  0.6858 0.005123       3
1se 0.2147    98  0.6891 0.005021       3

R-squared on the training data: 0.8220202

Assumptions

Here we check our assumptions of constant variance and normality. We can see from the residual vs fitted plot that the assumption of constant variance is most likely broken as there is a pattern in the residual vs fitted plot. In our normality check we see that the points somewhat follow the red line in the middle but at the beginning and end of the plot the points are very far away from the line. This also indicates that the assumption of normality is most likely broken.

Findings and Discussion

The coefficient path plot shows us the paths of the coefficients as the log of lambda changes. Each line represents a coefficient from the model, and its path shows how the coefficient value changes as lambda increases. The cross validation plot shows the cross-validation curve for selecting lambda. The x-axis represents the log of lambda, while the y-axis shows the mean squared error (MSE) for the cross-validation folds. Each dot represents the MSE for a specific lambda during cross-validation. These plots show the simplicity of the model and slight overfitting. From our model we can see that our optimal lambda value is 0.178 which can indicate a few things such as the complexity of the model and if our model is over fitted. Since the value is low the model may need to be more complex and there is slight overfitting that occurs. We can also see that our accuracy of the model is 0.822 which is pretty good however, since the optimal lambda value is a little low it may be because of over fitting. Overall the model does a decent job of using Distance, Points Accumulated, and Points Redeemed to predict the total number of flights we cannot be certain on the model’s validity as the assumptions of constant variance and normality were not passed.

kNN Classification

Introduction

This section delves into how the kNN classification model can predict customer loyalty point redemption. Airline loyalty programs are crucial for retaining customers and encouraging repeated business. Understanding patterns in how and when loyalty points are redeemed can help airlines tailor their programs to better meet customer needs, predict future redemptions, and manage capacity effectively. Using the kNN classification model, we aim to assesses the proximity of individual customer profiles to identify patterns that suggest a higher likelihood of point redemption. By analyzing key features such as travel frequency, accrued points, and redemption history, we can better target promotional efforts to enhance participation in the loyalty program.

Research Question: Which customer profile attributes are most predictive of loyalty point redemption, and how accurately can the kNN classification model forecast these point redemptions?

Link: https://mavenanalytics.io/data-playground?search=airline+loyalty+program

Data Preparation

Data preparation steps (click to expand)
customer_loyalty_history <- read.csv("/Users/jitpapneja/Google Drive/TCCC Old/Personal/Airline+Loyalty+Program (1)/Customer Loyalty History.csv")
calendar <- read.csv("/Users/jitpapneja/Google Drive/TCCC Old/Personal/Airline+Loyalty+Program 2/Calendar.csv")
customer_flight_activity <- read.csv("/Users/jitpapneja/Google Drive/TCCC Old/Personal/Airline+Loyalty+Program 2/Customer Flight Activity.csv")
merged_data <- merge(customer_flight_activity, customer_loyalty_history, by = "Loyalty.Number")
##Preprocessing of data :
merged_data <- merged_data %>%
  mutate(
    Country = as.factor(Country),
    Province = as.factor(Province),
    Gender = as.factor(Gender),
    Education = as.factor(Education),
    Marital.Status = as.factor(Marital.Status),
    Enrollment.Tenure = year(Sys.Date()) - Enrollment.Year,
    Cancellation.Year = ifelse(is.na(Cancellation.Year), 0, Cancellation.Year),
    Cancellation.Month = ifelse(is.na(Cancellation.Month), 0, Cancellation.Month)
  )

Model Training

Model Training (click to expand)
set.seed(123)
indices <- sample(1:nrow(merged_data), size = 0.7 * nrow(merged_data))
train_df <- merged_data[indices, ]
test_df <- merged_data[-indices, ]
train_df$PointsRedeemedBinary <- ifelse(train_df$Points.Redeemed
 != 0, 1, 0)
test_df$PointsRedeemedBinary <- ifelse(test_df$Points.Redeemed
 != 0, 1, 0)
relevant_vars <- c('Loyalty.Card', 'Total.Flights', 'Points.Accumulated', 'Dollar.Cost.Points.Redeemed', 'Salary')
complete_rows_train <- complete.cases(train_df[, relevant_vars])
train_df <- train_df[complete_rows_train, ]
complete_rows_test <- complete.cases(test_df[, relevant_vars])
test_df <- test_df[complete_rows_test, ]
train_features <- data.frame(model.matrix(~ Loyalty.Card + Total.Flights + Points.Accumulated + Dollar.Cost.Points.Redeemed + Salary - 1, data = train_df))
test_features <- data.frame(model.matrix(~ Loyalty.Card + Total.Flights + Points.Accumulated + Dollar.Cost.Points.Redeemed + Salary - 1, data = test_df))
train_labels <- train_df$PointsRedeemedBinary
test_labels <- test_df$PointsRedeemedBinary
k <- 9 
knn_result <- FNN::knn(train = train_features, test = test_features, cl = train_labels, k = k)

Model Performance

Model performance metrics (click to expand)
# Load the required libraries
knn_result <- factor(knn_result, levels = c("0", "1"))
test_labels <- factor(test_labels, levels = c("0", "1"))
conf_matrix <- confusionMatrix(data = knn_result, reference = test_labels)
precision <- posPredValue(conf_matrix$table, positive = "1")
recall <- sensitivity(conf_matrix$table, positive = "1")
f1_score <- (2 * precision * recall) / (precision + recall)
conf_matrix_df <- as.data.frame.matrix(conf_matrix$table)
metrics_df <- data.frame(
  Metric = c("Accuracy", "Precision", "Recall", "F1 Score"),
  Value = c(conf_matrix$overall['Accuracy'], precision, recall, f1_score)
)
confusion_matrix_table <- kable(conf_matrix_df, caption = "Confusion Matrix", align = 'c') %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
performance_metrics_table <- kable(metrics_df, caption = "Performance Metrics", align = c('l', 'c')) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

cross_tab_table <- CrossTable(knn_result, test_labels, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE) %>%
  as.data.frame() %>%
  kable(caption = "Cross-Tabulation", align = c('l', 'c')) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Cell Contents |————————-| | N | | N / Col Total | |————————-|

Total Observations in Table: 88229

         | test_labels 
knn_result 0 1 Row Total
0 82904 4980 87884
1.000 0.938
————- ———– ———– ———–
1 14 331 345
0.000 0.062
————- ———– ———– ———–
Column Total 82918 5311 88229
0.940 0.060
————- ———– ———– ———–
print(confusion_matrix_table)
Confusion Matrix
0 1
0 82904 4980
1 14 331
print(performance_metrics_table)
Performance Metrics
Metric Value
Accuracy 0.9433973
Precision 0.9594203
Recall 0.0623235
F1 Score 0.1170438
print(cross_tab_table)
Cross-Tabulation
t.x t.y t.Freq prop.row.x prop.row.y prop.row.Freq prop.col.x prop.col.y prop.col.Freq prop.tbl.x prop.tbl.y prop.tbl.Freq
0 0 82904 0 0 0.9433344 0 0 0.9998312 0 0 0.9396457
1 0 14 1 0 0.0405797 1 0 0.0001688 1 0 0.0001587
0 1 4980 0 1 0.0566656 0 1 0.9376765 0 1 0.0564440
1 1 331 1 1 0.9594203 1 1 0.0623235 1 1 0.0037516

Findings and Discusssion

Findings and Discusssion(click to expand)
  1. Precision: The proportion of positive identifications that were actually correct.The precision value of 0.9594 indicates a high likelihood that when the model predicts a customer will redeem points, this prediction is correct. In this context, if a promotional offer or incentive is provided based on the prediction, a high precision ensures that resources are not wasted on customers who are unlikely to redeem points.

  2. Recall: The proportion of actual positives that were identified correctly. The recall value of 0.0623 is quite low, indicating that the model identifies a relatively small proportion of all actual positive cases (customers who did redeem points). This suggests that while the predictions made by the model are usually correct, it misses a large number of customers who would redeem points but were not predicted to do so.

  3. F1 Score: The F1 score, at 0.1170, is quite low, which highlights a problem with the balance between precision and recall in the model. The F1 score is the harmonic mean of precision and recall. A low F1 score in the presence of a high precision and low recall indicates that the cost of false negatives (failing to identify customers who would redeem points) significantly impacts the model’s performance.

In conclusion, the model appears to be rather conservative in predicting point redemption, focusing on being correct when it does make such a prediction (high precision), but at the expense of failing to identify a substantial number of customers who do redeem points (low recall). In this context, this may mean that while the promotions might be efficiently allocated, there are missed opportunities in reaching customers who would engage with the loyalty program if prompted.

Model Accuracy

The accuracy of the kNN model is 0.943397295673758, indicating that approximately 94.34% of the predictions made by the model are correct. While the high accuracy is reassuring, given the precision, recall and f1 score.In summary, the accuracy tells us the model is overall reliable in its predictions, but it tends to be extremely selective, which end in missed opportunites to identifying many customers who would have redeemed points. This selective approach is efficient in not wasting resources but not effective in engaging all potential customers.

Information on Secondary Chart

Introduction

The dataset provided contains detailed information on the population of electric vehicles (EVs) registered within a specific geographic region, presumably focusing on the state of Washington, USA, given the frequent appearance of cities and counties within this state in the data. It comprises 177,866 records, each representing a unique electric vehicle, as identified by a partial Vehicle Identification Number (VIN) listed in the “VIN (1-10)” column. The dataset encompasses a range of variables including geographic (County, City, State, Postal Code, Vehicle Location, 2020 Census Tract), vehicle specifics (Model Year, Make, Model, Electric Vehicle Type, Electric Range, Base MSRP), registration details (DOL Vehicle ID), legislative context (Legislative District), and utility information (Electric Utility).The dataset’s timeframe spans from vehicles manufactured in 1997 up to the model year 2024, showcasing the growth and development of the electric vehicle market over nearly three decades.

Abstract

Analysis opportunities range from examining EV distribution and density to assessing trends in adoption over time and exploring the relationship between EV usage and legislative or utility districts. This dataset serves as a valuable resource for informing stakeholders, including policymakers, manufacturers, and energy providers, about the current state and future prospects of EV adoption and infrastructure development.

Rationale for this dataset

The secondary dataset was chosen due to compatibility issues encountered with the airline loyalty dataset. MLR (Multiple Linear Regression), logistic regression and loess functions were not suitable for the airline dataset, resulting in low-quality outputs with an R-squared value of 0.05 or lower. Despite attempting various combinations and transformations, the desired results could not be achieved. Therefore, an alternative dataset was sought where MLR and loess analyses could be conducted effectively.

Data Dictionary:

Attribute Description
VIN (1-10) Vehicle identification number first 10 characters.
County Administrative division where the vehicle is registered.
City Urban settlement where the vehicle owner resides.
State US state where the vehicle is registered.
Postal.Code ZIP code for the vehicle owner’s address.
Model.Year Year of manufacture of the vehicle.
Make Manufacturer or brand of the vehicle.
Model Specific model of the electric vehicle.
Electric.Vehicle.Type Classification of electric vehicle, such as Battery Electric Vehicle (BEV).
Clean.Alternative.Fuel.Vehicle Indicates whether the vehicle qualifies as a Clean Alternative Fuel Vehicle.
Electric.Range Maximum distance the vehicle can travel on a single charge.
Base.MSRP Manufacturer’s suggested retail price without any additional features.
Legislative.District Legislative district where the vehicle is registered.
DOL.Vehicle.ID Department of Licensing’s unique identifier for the vehicle.
Vehicle.Location Geographic coordinates of the vehicle’s registered location.
Electric.Utility Utility provider supplying electricity to the vehicle’s location.
2020.Census.Tract Census tract where the vehicle owner resides, based on the 2020 Census.

LOESS (Local Polynomial Regression)

Introduction

In addition to Multiple Linear Regression (MLR), another powerful technique for analyzing the relationship between variables is LOESS (Local Polynomial Regression). LOESS is a non-parametric regression method that allows for more flexible modeling of complex relationships between variables.LOESS works by fitting a smooth curve to the data by locally estimating a polynomial regression model.

Abstract

Using the Electric Vehicle dataset, LOESS could be utilized to explore the potential non-linear relationships between predictors (such as Model Year, MSRP, or Legislative District) and the Electric Range of EVs.

Data Preparation

During the data preparation phase, due to the large size of the dataset, it was necessary to sample from it. Notably, the analysis conducted on the sample closely mirrored the results obtained from the full dataset. This close correspondence between the sample and full dataset results suggests that the sample is likely representative of the entire population. As such, meaningful conclusions can be drawn from the analysis performed on the sampled data.
Data preparation steps (click to expand)
library(tidyverse)
library(ggplot2)
df <- read_csv("/Users/jitpapneja/Google Drive/TCCC Old/Personal/Electric_Vehicle_Population_Data (1).csv")
# Data preprocessing: Impute missing values and filter
missing_values <- sum(is.na(df))
numeric_cols <- sapply(df, is.numeric)
df[, numeric_cols] <- lapply(df[, numeric_cols], function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))
categorical_cols <- sapply(df, is.character)
df[, categorical_cols] <- lapply(df[, categorical_cols], function(x) ifelse(is.na(x), "Unknown", x))
df <- df[df$Electric.Range > 0, ]

##sampling 
library(dplyr)
sample_size <- 40000
sampled_df <- df %>% sample_n(sample_size)

Model Fitting

(click to expand)
Plotting the LOESS smoothed curve with span=0.25

Plotting the LOESS smoothed curve with span=0.50

Plotting the LOESS smoothed curve with span=0.75

Findings and Discusssion

  1. Analysis: Based on the plots, there seems to be a general increasing trend in electric range starting from the earliest model year shown until around 2012, where it reaches a peak. Around 2020, the electric range hits its maximum value on the smooth curve, suggesting that models from around that year had the highest electric range on average. After the peak, there’s a sharp decline in the electric range, which could suggest a change in production models, a shift in manufacturer priorities, or possibly the introduction of many new models with lower ranges. There is considerable variability in the data points, especially in the later years, where the electric range varies widely across different models. There is also an indication of potential outliers as there are points that seem to deviate significantly from the majority of the data, especially in the later years, which could be outliers or special types of vehicles such as high-performance models.

  2. Model Fit: The LOESS curve fits the data points in the middle range well. But the fit is less accurate at the extreme ends, particularly in the latest years, where there is more scatter away from the curve. The curve appears to have a moderate span, as it smooths out much of the variability without overfitting. It captures the overall trend without following every small variation in the data points.

  3. Different Spans: As expected, the plot with span of 0.75 had a smoother curve, as it takes into account a larger fraction of points in each local regression, leading to a more general trend with less sensitivity to local fluctuations. In comparison to the other two, A span of 0.25 will produce the least smooth curve, as it considers a smaller fraction of points, allowing the curve to follow local variations more closely. The plot with a higher span will highlight broader trends in the data over time, possibly showing a general increase or decrease in electric range. The plot with a lower span will show more of the short-term fluctuations or variations within the data, which could be important for understanding yearly or model-specific changes in electric range.

In summary, the plot indicates an improvement in electric vehicle range up to around 2020, with a subsequent unexpected decline. It also highlights the variability in electric ranges across models, especially in more recent years. Therefore, the plots do answer the research question by illustrating the nature of the relationship between Model Year and Electric Range and demonstrating how this relationship has changed over time. The LOESS curve effectively captures the central trend despite the individual variabilities. And the dispersion of points around the curve in recent years highlights that the relationship is not strictly linear and may be influenced by other factors not captured.

MLR Dashboard

Introduction

The MLR (Multiple Linear Regression) Dashboard is designed to provide comprehensive insights into the factors influencing electric vehicle (EV) adoption within Washington state, USA. This tool utilizes a robust dataset comprising 177,866 unique records spanning EVs manufactured from 1997 to 2024, encompassing variables such as geographic data, vehicle specifics, registration details, legislative context, and utility information.

Data Preparation

Data preparation steps
df <- read_csv("/Users/jitpapneja/Google Drive/TCCC Old/Personal/Electric_Vehicle_Population_Data (1).csv")
missing_values <- sum(is.na(df))
numeric_cols <- sapply(df, is.numeric)
df[, numeric_cols] <- lapply(df[, numeric_cols], function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))
categorical_cols <- sapply(df, is.character)
df[, categorical_cols] <- lapply(df[, categorical_cols], function(x) ifelse(is.na(x), "Unknown", x))
df <- df[df$Electric.Range > 0, ]

Model Training

Model performance metrics
target <- 'Electric.Range'
predictors <- c('Model.Year', 'Make', 'Model', 'Electric.Vehicle.Type', 'Clean.Alternative.Fuel.Vehicle', 'Base.MSRP', 'Legislative.District')

df[predictors] <- lapply(df[predictors], factor)
df_cleaned <- df[-c(18916, 62033, 66313), ]

set.seed(123) # For reproducibility
training_index <- createDataPartition(df_cleaned[[target]], p = 0.8, list = FALSE)
train_data <- df_cleaned[training_index, ]
test_data <- df_cleaned[-training_index, ]
formula <- as.formula(paste(target, '~', paste(predictors, collapse = ' + ')))

mlr_model <- lm(formula, data = train_data)
summary_mlr <- summary(mlr_model)
predictions <- predict(mlr_model, test_data)

mse <- mean((test_data[[target]] - predictions)^2)
r2 <- summary_mlr$r.squared
cat("MSE: ", mse)

MSE: 281.423

cat("R^2: ", r2)

R^2: 0.9720977

Model Summary

Model summary

Call:

lm(formula = formula, data = train_data)

Residuals:

  • Min: -80.650
  • 1Q: -6.811
  • Median: -0.851
  • 3Q: 5.720
  • Max: 66.094

Coefficients (Snippet):

Below is a snippet of the coefficients from the model.

Term Estimate Std. Error t value Pr(>
(Intercept) 6.078469 16.950329 0.359 0.719892
Model.Year1998 13.203620 23.492188 0.562 0.574089
Model.Year1999 29.386572 18.205588 1.614 0.106499
Model.Year2000 13.503411 18.204538 0.742 0.458235

Model Fit:

  • R-squared: 0.9721
  • Adjusted R-squared: 0.972
  • F-statistic: 1.768e+04 on 135 and 68596 DF
  • p-value: < 2.2e-16

Key Findings:

  • Coefficient Estimates: The coefficients represent the estimated change in Electric Range for a one-unit change in each predictor variable, holding all other variables constant. For example, a coefficient of 13.20 for Model Year 1998 suggests that, on average, EVs manufactured in 1998 have an Electric Range approximately 13.20 units higher than the reference category.

  • Model Fit: The R-squared value of 0.9721 indicates that approximately 97.21% of the variance in Electric Range is explained by the predictors included in the model. This suggests a strong overall fit of the model to the data, indicating that the selected predictors collectively provide a robust explanation of Electric Range variation in EVs.This means that the selected predictors collectively have a substantial explanatory power in determining Electric Range variation in EVs.

Model Performance and Diagnostic Plots

Model performance metrics (click to expand)

integer(0) integer(0)

##Key Findings:

Assuming the thresholds mentioned below: - Cook’s distance: A common threshold for Cook’s distance is 1. Values exceeding 1 suggest potentially influential points. - Leverage values: Leverage values range from 0 to 1, with higher values indicating higher leverage. A common threshold for high leverage points is 2 * (p + 1) / n, where p represents the number of predictors and n is the sample size. - Studentized residuals: A common threshold for identifying outliers based on studentized residuals is ±3. Observations with absolute studentized residuals exceeding these thresholds are considered potential outliers.

  • Leverage Plot: Most of the data points have low leverage as the majority of the observations have hat values that are low and cluster near zero, suggesting that they do not have a significant influence on the regression model’s estimates. There are a few points with higher leverage and these are the points that stand out from the cluster and have larger hat values. These points may be outliers and might have a substantial impact on the slope.
  1. Residuals vs Fitted Plot: The spread of residuals appears to have a non-constant variance with a pattern that could suggest a non-linear relationship.There are also a few outliers which can potentially affect the model’s accuracy.

  2. Q-Q Plot:The plot does not perfectly follow the normal distribution. There is a deviation from the line in the tails, especially in the upper tail, indicating the presence of outliers or extreme values.

  3. Scale-Location Plot: This plot shows that the spread of residuals is not uniform across the range of fitted values, which could imply that the variance of the residuals is not constant. There are also some points with higher spread at the higher end of fitted values.

  4. Residuals vs Leverage Plot: The Cook’s distance lines suggest that there are points that are potential influencers.

Enhancing the model

Upon finding the great number of outliers, we thought it would be a good idea to see how the model changes after removing the outliers.
# Removing  outliers
train_data_clean <- train_data[-outliers, ]
mlr_model_clean <- lm(formula, data = train_data_clean)
summary_obj <- summary(mlr_model_clean)
f_value <- summary_obj$fstatistic["value"]
f_df1 <- summary_obj$fstatistic["numdf"] 
f_df2 <- summary_obj$fstatistic["dendf"] 
p_value <- pf(f_value, f_df1, f_df2, lower.tail = FALSE)
model_stats_clean_df <- data.frame(
  Metric = c("Residual Standard Error", "Degrees of Freedom", "R-squared", "Adjusted R-squared", "F-statistic", "p-value"),
  Value = c(
    sprintf("%.2f on %d degrees of freedom", sigma(mlr_model_clean), df.residual(mlr_model_clean)),
    NA, 
    sprintf("%.4f", summary(mlr_model_clean)$r.squared),
    sprintf("%.4f", summary(mlr_model_clean)$adj.r.squared),
    sprintf("%.2e on %d and %d DF", f_value, f_df1, f_df2),
    format.pval(p_value, digits = 2) 
  )
)
model_stats_clean_df <- na.omit(model_stats_clean_df)
library(knitr)
kable(model_stats_clean_df, format = "markdown", col.names = c("Metric", "Value"), align = 'l')
Metric Value
1 Residual Standard Error 14.30 on 67665 degrees of freedom
3 R-squared 0.9793
4 Adjusted R-squared 0.9793
5 F-statistic 1.74e+04 on 184 and 67665 DF
6 p-value <2e-16
plot(mlr_model_clean)

model_distribution <- table(df$Model)

Key Findings: Upon finding the great number of outliers, we thought it would be a good idea to see how the model changes after removing the outliers. Unfortunately, we do not see much of a change except for some sparseness in the data points. Howver there are a few minute changes in the metric.

Comparing the two model fits, we can observe the following:

Coefficients: Upon expanding the summary(mlr_model_clean), we see that a lot of the coefficients with NAs have been removed after removing the outliers as opposed to before.

R-squared: The previous model fit had an R-squared value of 0.9721. In the current model fit, after removing outliers, the R-squared value increased to 0.9793, indicating that approximately 97.93% of the variability is now explained by the independent variables. This suggests a slight improvement in the model’s ability to explain the variation in the dependent variable.

Adjusted R-squared: The adjusted R-squared values for both model fits are very similar. In the previous model fit, the adjusted R-squared was 0.972, while in the current model fit, it remains at 0.9793. This suggests that the additional predictors introduced in the current model did not significantly impact the overall model fit.

F-statistic: The F-statistic assesses the overall significance of the model. In the previous model fit, the F-statistic was 1.768e+04, while in the current model fit, it is slightly lower at 1.74e+04.

Overall, the current model fit, after removing outliers, shows a slightly higher R-squared value, indicating a better fit to the data. The adjusted R-squared, F-statistic, and p-value remain similar between the two model fits, suggesting a consistent and significant fit. From the model output, it is evident that there are several outliers present in the data. However, upon examining the results, no high leverage or influential points meet the specified threshold criteria. And the plots have not changed much either.

Note: We tried to check the presence of multicollinearity by using the VIF, but there was an error about aliases.To fix it, we tried to find the pair of aliases but alias() reports no pairs of aliases. We assume this is due to near-perfect multicollinearity or very high correlation among some of the independent variables, which could be approaching the threshold of numerical tolerance used in the VIF function.

Further enhancing the model using step wise model selection

To understand which predictor was good to remove, we did stepWise model selection
#more data cleaning along the way :  trying to identify sparse levels in the model_distribution variable and filtering the dataframe df to exclude the rows corresponding to the sparse levels.
sparse_levels <- names(model_distribution[model_distribution <= 1])
df$Model_Combined <- ifelse(df$Model %in% sparse_levels, 'Other', as.character(df$Model))
df$Model_Combined <- factor(df$Model_Combined)
df <- df[!df$Model %in% sparse_levels, ]

#StepWise 
library(MASS)
# Initial model fitting
initial_formula <- as.formula(paste(target, "~", paste(predictors, collapse = " + ")))
initial_model <- lm(initial_formula, data = train_data)
step_model <- stepAIC(initial_model, direction = "both")

Start: AIC=386320.7 Electric.Range ~ Model.Year + Make + Model + Electric.Vehicle.Type + Clean.Alternative.Fuel.Vehicle + Base.MSRP + Legislative.District

Step: AIC=386320.7 Electric.Range ~ Model.Year + Model + Electric.Vehicle.Type + Clean.Alternative.Fuel.Vehicle + Base.MSRP + Legislative.District

                             Df Sum of Sq      RSS    AIC
  • Legislative.District 49 25896 18899390 386317 18873494 386321
  • Base.MSRP 20 475307 19348801 387990
  • Clean.Alternative.Fuel.Vehicle 1 1128737 20002230 390311
  • Electric.Vehicle.Type 1 4920090 23793584 402241
  • Model.Year 17 24505452 43378945 443487
  • Model 84 81124025 99997519 500756

Step: AIC=386317 Electric.Range ~ Model.Year + Model + Electric.Vehicle.Type + Clean.Alternative.Fuel.Vehicle + Base.MSRP

                             Df Sum of Sq       RSS    AIC

18899390 386317 + Legislative.District 49 25896 18873494 386321 - Base.MSRP 20 474303 19373693 387981 - Clean.Alternative.Fuel.Vehicle 1 1120882 20020272 390275 - Electric.Vehicle.Type 1 4955720 23855110 402321 - Model.Year 17 24522923 43422313 443457 - Model 84 82268999 101168389 501458

plot(step_model)

model_summary <- summary(step_model)

Final Model Selection using log transformation

Based on the output above, we understood that removing a few more predictors may help the model and also taking the warning sign into consideration, we removed them from our data as they were high leverage points and refit the model as seen below. We have also applied a log transformation to the target variable. By reducing the number of predictors, we aim to simplify the model and improve its performance. Additionally, the log transformation helps address any skewness or non-linearity in the target variable, leading to better modeling results.
new_predictors <- c("Model.Year", "Make", "Model", "Electric.Vehicle.Type", "Base.MSRP")
to_remove <- c(18910, 24399, 62007, 62619, 64883, 66291)
train_data_cleaned <- train_data[-to_remove, ]
train_data_cleaned$Electric.Range <- log(train_data_cleaned$Electric.Range)
new_formula <- as.formula(paste("log(", target, ") ~", paste(predictors, collapse = ' + ')))
final_model_cleaned <- lm(formula = new_formula, data = train_data_cleaned)
plot(final_model_cleaned)

Model Summary

Model summary

Call:

lm(formula = new_formula, data = train_data_cleaned)

Residuals:

  • Min: -0.281519
  • 1Q: -0.011846
  • Median: -0.000927
  • 3Q: 0.011753
  • Max: 0.222713

Coefficients (Snippet):

Below is a snippet of the coefficients from the model.

Term Estimate Std. Error t value p-value
(Intercept) 6.078469 16.950329 0.359 0.719892
Model.Year1998 13.203620 23.492188 0.562 0.574089
Model.Year1999 29.386572 18.205588 1.614 0.106499
Model.Year2000 13.503411 18.204538 0.742 0.458235

Model Fit:

  • R-squared: 0.99
  • Adjusted R-squared: 0.9899
  • F-statistic: 3.802e+04 on 178 and 68547 DF
  • p-value: < 2.2e-16 Significance codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.02589 on 68547 degrees of freedom

Key Findings:

The new model includes predictors such as Model.Year, Model, Electric.Vehicle.Type, and Base.MSRP. The coefficients for these predictors suggest varying levels of influence on the dependent variable Electric.Range. For instance, certain model years show a negative association, implying that electric range decreases compared to the baseline year. In contrast, other years, like Model.Year2020, show a significant positive effect, indicating a higher electric range. The Electric.Vehicle.TypePlug-in Hybrid Electric Vehicle (PHEV) predictor has a substantial negative coefficient, which means that compared to all-electric vehicles, plug-in hybrids have a significantly lower electric range. The Base.MSRP variable has a varied effect on the electric range. Certain MSRP values are associated with a significant increase or decrease in the electric range, indicating that price may correlate with range capabilities.

The model has a very high R-squared value of 0.99, which implies that approximately 99.0% of the variability in the electric range can be explained by the model’s predictors. The adjusted R-squared is equally high at 0.989, showing that the model fits the data well and that the included predictors contribute to the model’s explanatory power. The F-statistic is also extremely high with a practically zero p-value, which provides evidence against the null hypothesis. This means that the model is statistically significant, and there’s a relationship between the predictors and the electric range. However, despite the high R-squared values, the assumptions are definitely not met completely as seen below.

Checking Assumptions:

  1. The Residuals vs Leverage plot: There are a few points outside the Cook’s distance lines, which suggests the presence of influential observations. These points have the potential to potentially influence the model’s predictions.

  2. The Q-Q plot: Most points follow the line, but there is a clear deviation at both tails, indicating that the residuals have heavier tails than a normal distribution. This suggests that the normality assumption may be violated.

  3. The Scale-Location plot: The plot shows a somewhat uniform spread of residuals across the range of fitted values with a slightly increasing trend, suggesting that variance may increase with the fitted values and it may indicate potential non-constant variance.

  4. The Residuals vs Fitted Plot:The residuals are centered around the horizontal line at zero, which is good. There might be a slight pattern which could suggest some non-constant variance (heteroscedasticity) and there are still some outliers present but since the red line is relatively straight, it could pass the assumption.

The model has some indicators of potential violations of the key assumptions but after much changes made to the data, it seems like there is not much that can be done since the summary shows a robust model with a strong fit to the data. Although, the presence of many outliers could be a concern and may raise concerns regarding validity of model inferences.